A Small Database with an Adaptive Data Selection Method for Solder Joint Fatigue Life Prediction in Advanced Packaging

There has always been high interest in predicting the solder joint fatigue life in advanced packaging with high accuracy and efficiency. Artificial Intelligence Plus (AI+) is becoming increasingly popular as computational facilities continue to develop. This study will introduce machine learning (a core component of AI). With machine learning, metamodels that approximate the attributes of systems or functions are created to predict the fatigue life of advanced packaging. However, the prediction ability is highly dependent on the size and distribution of the training data. Increasing the amount of training data is the most intuitive approach to improve prediction performance, but this implies a higher computational cost. In this research, the adaptive sampling methods are applied to build the machine learning model with a small dataset sampled from an existing database. The performance of the model will be visualized using predefined criteria. Moreover, ensemble learning can be used to improve the performance of AI models after they have been fully trained.


Introduction
As the demand for electronic products continues to rise, electronic packaging gradually moves towards miniaturization and high density.Wafer-level chip scale packaging (WLCSP) offers a significant advantage in effectively reducing the package size.Analyzing solder ball reliability is crucial for assessing the long-term performance of WLCSP.The difference in the coefficient of thermal expansion (CTE) between various materials is a major factor that affects the reliability of WLCSP.In the accelerated thermal cycling test (ATCT), the failure mainly happens at the corner of the solder ball with the greatest Distance to Neutral Point (DNP) [1].
Although experiments can provide reliable results, they require a significant amount of time, are costly, and are environmentally unfriendly; therefore, the design-by-experiment approach is not suitable for electronic packaging.In order to reduce ATCT experiment numbers and complete the initial design of packaging structures, finite element analysis (FEA) is generally used.A 3D finite element model of a wafer-level chip scale package (WLCSP) was developed by Liu et al. [2].The incremental value of equivalent plastic strain for each cycle was calculated.The incremental value that tends to stabilize was incorporated into the Coffin-Manson equation to predict solder ball reliability, and the simulation results were close to those obtained in experiments.As a result, 3D finite element simulation can be time-consuming and involves many detailed models.According to Tsou [3], for symmetric packaging structures, a 2D finite element model can also match the experiment result by adjusting the mesh size of key regions, and the simulation time can be significantly reduced.The simulation results are shown in Figure 1.The use of simulations to replace experiments has many benefits, but it also presents challenges: the development of an effective simulation process requires expertise in domain knowledge and finite element theory, and the results obtained by different researchers may differ.One effective approach to address the abovementioned difficulties is introducing machine learning.Due to the nonlinear modeling capabilities of the AI models, they can capture the response of the FEM with high accuracy.Hence, a well-trained AI model enables the researchers to obtain the desired results quickly.Figure 2 shows the workflow called AI-assisted Design of Simulation [4].This chart shows that both experiments and simulations can contribute to building the dataset for WLCSP.As many simulation results need to be generated for AI training, the 3D finite element model is too time-consuming and will not be as suitable as a 2D WLCSP model.Validated 2D models are the basis for generating our raw database, and the details of this model will be presented in the sections below.Once the dataset has been validated, the learning algorithm will be used to obtain the trained model.As soon as the trained model has been obtained, the designer can enter the information about the structure of the package, and the reliability life of the WLCSP can be immediately determined.The key factors of the training dataset, including the quality and quantity, directly impact the AI model's performance [5].As soon as the feature space of the AI model has been determined, there are generally two strategies for sampling data: the spacing-filling method and the adaptive method.Figure 3 illustrates a bounded 2D feature space.The spacing-filling technique in this figure ensures that the distribution of data points is always uniform.Furthermore, adaptive sampling techniques position a certain proportion of data points in the high-interest area.If the cost of data acquisition is not high, the spacing-filling design is the first choice to sample as many data points as possible to fill the One effective approach to address the abovementioned difficulties is introducing machine learning.Due to the nonlinear modeling capabilities of the AI models, they can capture the response of the FEM with high accuracy.Hence, a well-trained AI model enables the researchers to obtain the desired results quickly.Figure 2 shows the workflow called AI-assisted Design of Simulation [4].This chart shows that both experiments and simulations can contribute to building the dataset for WLCSP.As many simulation results need to be generated for AI training, the 3D finite element model is too time-consuming and will not be as suitable as a 2D WLCSP model.Validated 2D models are the basis for generating our raw database, and the details of this model will be presented in the sections below.Once the dataset has been validated, the learning algorithm will be used to obtain the trained model.As soon as the trained model has been obtained, the designer can enter the information about the structure of the package, and the reliability life of the WLCSP can be immediately determined.One effective approach to address the abovementioned difficulties is introducing machine learning.Due to the nonlinear modeling capabilities of the AI models, they can capture the response of the FEM with high accuracy.Hence, a well-trained AI model enables the researchers to obtain the desired results quickly.Figure 2 shows the workflow called AI-assisted Design of Simulation [4].This chart shows that both experiments and simulations can contribute to building the dataset for WLCSP.As many simulation results need to be generated for AI training, the 3D finite element model is too time-consuming and will not be as suitable as a 2D WLCSP model.Validated 2D models are the basis for generating our raw database, and the details of this model will be presented in the sections below.Once the dataset has been validated, the learning algorithm will be used to obtain the trained model.As soon as the trained model has been obtained, the designer can enter the information about the structure of the package, and the reliability life of the WLCSP can be immediately determined.The key factors of the training dataset, including the quality and quantity, directly impact the AI model's performance [5].As soon as the feature space of the AI model has been determined, there are generally two strategies for sampling data: the spacing-filling method and the adaptive method.Figure 3 illustrates a bounded 2D feature space.The spacing-filling technique in this figure ensures that the distribution of data points is always uniform.Furthermore, adaptive sampling techniques position a certain proportion of data points in the high-interest area.If the cost of data acquisition is not high, the spacing-filling design is the first choice to sample as many data points as possible to fill the The key factors of the training dataset, including the quality and quantity, directly impact the AI model's performance [5].As soon as the feature space of the AI model has been determined, there are generally two strategies for sampling data: the spacing-filling method and the adaptive method.Figure 3 illustrates a bounded 2D feature space.The spacing-filling technique in this figure ensures that the distribution of data points is always uniform.Furthermore, adaptive sampling techniques position a certain proportion of data points in the high-interest area.If the cost of data acquisition is not high, the spacing-filling design is the first choice to sample as many data points as possible to fill the entire feature space.Using a large training dataset can significantly improve the performance of AI models, as demonstrated quite intuitively in the previously published results of our lab [6].Since future work will involve more design parameters, relying only on spacing-filling techniques will require an exponentially increasing number of data points to stabilize the prediction performance of AI models in the future.It is therefore important to investigate small data training based on adaptive sampling.entire feature space.Using a large training dataset can significantly improve the p mance of AI models, as demonstrated quite intuitively in the previously published r of our lab [6].Since future work will involve more design parameters, relying on spacing-filling techniques will require an exponentially increasing number of data p to stabilize the prediction performance of AI models in the future.It is therefore imp to investigate small data training based on adaptive sampling.Additionally, the choice of machine learning algorithms is crucial.Referring process depicted in Figure 2, adaptive design is incorporated into the data gene module; this research selects an artificial neural network (ANN) and introduces ens learning in the training algorithm module.Dong et al. [9] conducted a comprehe review of mainstream approaches for ensemble learning.In essence, ensemble lea combines multiple weakly supervised models to achieve a stronger and more compr sive supervised model.As an extra machine learning strategy, this research will foc data generation and not delve extensively into ensemble learning.
In summary, to reduce data sampling costs and model training time, this res will explore the feasibility of accurately predicting the reliability of advanced pack using a small dataset.In fact, small data analysis has been applied across various Izonin et al. [10] improved the performance of small data analysis in the field of bio cal engineering using improved neural network techniques.Zhang [11] achieved p prediction of the target problems in the field of materials science by incorporating knowledge and using Kernel Ridge Regression (KRR) in cases with small datasets.studies often focus on optimizing learning algorithms due to the uncontrollable nat data acquisition.This paper, however, focuses on data sampling techniques beca uses simulation to generate datasets.This study will be optimized based on the p shown in Figure 2. Adaptive sampling techniques will be used to establish the data WLCSP.The learning algorithm will consistently use ANNs, and the trained mode be ensemble models to predict the final reliability.Additionally, the choice of machine learning algorithms is crucial.Referring to the process depicted in Figure 2, adaptive design is incorporated into the data generation module; this research selects an artificial neural network (ANN) and introduces ensemble learning in the training algorithm module.Dong et al. [9] conducted a comprehensive review of mainstream approaches for ensemble learning.In essence, ensemble learning combines multiple weakly supervised models to achieve a stronger and more comprehensive supervised model.As an extra machine learning strategy, this research will focus on data generation and not delve extensively into ensemble learning.

Coffin-Manson Model
In summary, to reduce data sampling costs and model training time, this research will explore the feasibility of accurately predicting the reliability of advanced packaging using a small dataset.In fact, small data analysis has been applied across various fields.Izonin et al. [10] improved the performance of small data analysis in the field of biomedical engineering using improved neural network techniques.Zhang [11] achieved precise prediction of the target problems in the field of materials science by incorporating prior knowledge and using Kernel Ridge Regression (KRR) in cases with small datasets.These studies often focus on optimizing learning algorithms due to the uncontrollable nature of data acquisition.This paper, however, focuses on data sampling techniques because it uses simulation to generate datasets.This study will be optimized based on the process shown in Figure 2. Adaptive sampling techniques will be used to establish the dataset for WLCSP.The learning algorithm will consistently use ANNs, and the trained models will be ensemble models to predict the final reliability.

Coffin-Manson Model
The Coffin-Manson equation [12] is an empirical strain-based method for predicting the fatigue life of packaging structures.Its expression is as follows: where N f is the mean time to failure (MTTF) cycles.α and ϕ are empirical constants, typically obtained through curve fitting.∆ε pl eq represents the incremental equivalent plastic strain.It can be defined as follows [13]: where ∆ε

Artificial Neural Network
Firstly, one needs to understand that machine learning can generally be divided into two categories: supervised learning and unsupervised learning.Essentially, the difference between these two methods is whether the training dataset targets are manually labeled.In this study, our task is to predict the reliability life of the packaging, since the input and output are defined, indicating that a supervised learning algorithm is indeed required.Considering both training time and predictive accuracy [14], the initial choice is the ANN algorithm to assess the quality of the small training set.Once the training set is finalized, the algorithm selection will be double-checked.
An ANN, proposed by McCulloch and Pitts [15], simulates how human neurons transmit information to connect inputs and outputs.The schematic illustration of its structure is shown in

K-Medoids
Like K-Means, the K-medoid algorithm [19] is a clustering algorithm.Basically, clustering is the process of dividing a dataset into different classes or clusters based on a specific criterion (such as distance), where the data points within each cluster exhibit high similarity, and the similarity between data points in different clusters is low.In accordance with the machine learning classification discussed earlier, K-medoids belong to the category of unsupervised learning.
As listed in Table 1, the essential difference between the two algorithms is that the cluster centers are actual data points in K-medoids, whereas, in K-Means, the cluster centers are computed virtual points.A clustering algorithm maximizes the dissimilarity between each cluster, and the cluster centers represent the most prominent features in the local feature space.Since K-medoid algorithms utilize actual points as cluster centers, this algorithm can be used to determine a certain number of cluster centers using an existing database as our training set.Certainly, K-Means can be used to establish a new database if starting from scratch.Despite this, rounding feature values is necessary since its cluster centers are derived from average values.In this regard, it is not recommended.
Here is a brief description of the general steps of the K-medoids algorithm: 1. Initialization: Select k data points as initial cluster centers (specify or random pick); 2. Cluster Assignment: Assign each data point to the cluster with the nearest cluster center; 3. Center Update: For each cluster, choose a new cluster center that minimizes the sum of distances to all other points in the cluster; 4. Repeat steps 2 and 3 until the cluster centers stabilize or the maximum number of iterations is reached.To perform weight updates, ANNs first need to choose an appropriate solver.This study chose two widely used solvers: Adaptive Moment Estimation (Adam) and Limitedmemory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS).Adam [17] is an optimization algorithm that combines the momentum method and RMSProp, offering the advantages of adaptive learning rates and momentum.L-BFGS [18] is a quasi-Newton optimization algorithm suitable for large-scale optimization problems.It achieves efficient second-order optimization by approximating the Hessian matrix.

K-Medoids
Like K-Means, the K-medoid algorithm [19] is a clustering algorithm.Basically, clustering is the process of dividing a dataset into different classes or clusters based on a specific criterion (such as distance), where the data points within each cluster exhibit high similarity, and the similarity between data points in different clusters is low.In accordance with the machine learning classification discussed earlier, K-medoids belong to the category of unsupervised learning.
As listed in Table 1, the essential difference between the two algorithms is that the cluster centers are actual data points in K-medoids, whereas, in K-Means, the cluster centers are computed virtual points.A clustering algorithm maximizes the dissimilarity between each cluster, and the cluster centers represent the most prominent features in the local feature space.Since K-medoid algorithms utilize actual points as cluster centers, this algorithm can be used to determine a certain number of cluster centers using an existing database as our training set.Certainly, K-Means can be used to establish a new database if starting from scratch.Despite this, rounding feature values is necessary since its cluster centers are derived from average values.In this regard, it is not recommended.Here is a brief description of the general steps of the K-medoids algorithm: 1. Initialization: Select k data points as initial cluster centers (specify or random pick); 2.
Cluster Assignment: Assign each data point to the cluster with the nearest cluster center; 3.
Center Update: For each cluster, choose a new cluster center that minimizes the sum of distances to all other points in the cluster; 4.
Repeat steps 2 and 3 until the cluster centers stabilize or the maximum number of iterations is reached.
Finally, the differences between the two algorithms are shown in Figure 6.As mentioned earlier, medoids are all actual sample points.

Algorithm
Cluster Center Updated Method K-Medoids Actual sample point Minimization of the sum of distances K-Means Virtual point The average value of the data points Finally, the differences between the two algorithms are shown in Figure 6.As mentioned earlier, medoids are all actual sample points.

Ensemble Learning
Small data training becomes essential when training and acquiring data are costly.The variance of ANN-trained models is more significant when the dataset is small.In addition to using adaptive methods to optimize the distribution of training data, ensemble learning can also improve AI models' predictive accuracy.
In ensemble learning, multiple trained models are combined to improve the performance of the model predictions.While this research will examine this method as an additional method for improving model performance, it will not be explored in depth.Generally, ensemble learning can be divided into three categories: bagging, boosting, and stacking.In this study, only the results of bagging are presented.
As proposed by Breiman [20], bagging is a simple and efficient method.There are three main components to this method: training data, ensemble models, and combination.By varying the number of hidden layers and neurons, and then combining them, different ANN models can be obtained.In the case of combination, each ANN model is given equal weight, and the final prediction is the average of all ANN models.Figure 7 illustrates the schematic diagram of the structure.Generally, it is necessary to vary the data used for training each model in the ensemble when it comes to training data.In this study, ensemble learning is not the focus.To only validate the effectiveness of ensemble learning, all models will be trained using the same training set.

Ensemble Learning
Small data training becomes essential when training and acquiring data are costly.The variance of ANN-trained models is more significant when the dataset is small.In addition to using adaptive methods to optimize the distribution of training data, ensemble learning can also improve AI models' predictive accuracy.
In ensemble learning, multiple trained models are combined to improve the performance of the model predictions.While this research will examine this method as an additional method for improving model performance, it will not be explored in depth.Generally, ensemble learning can be divided into three categories: bagging, boosting, and stacking.In this study, only the results of bagging are presented.
As proposed by Breiman [20], bagging is a simple and efficient method.There are three main components to this method: training data, ensemble models, and combination.By varying the number of hidden layers and neurons, and then combining them, different ANN models can be obtained.In the case of combination, each ANN model is given equal weight, and the final prediction is the average of all ANN models.Figure 7 illustrates the schematic diagram of the structure.Generally, it is necessary to vary the data used for training each model in the ensemble when it comes to training data.In this study, ensemble learning is not the focus.To only validate the effectiveness of ensemble learning, all models will be trained using the same training set.

Validation of FEA Model
A finite element analysis (FEA) model must be validated before constructing a simulation database.A 2D finite element model with a half-diagonal of the WLCSP will be developed in this study to reduce simulation time.Figures 8 and 9 provide a schematic diagram of its structure.Figure 9 shows that the x-direction displacement at the left posi-

Validation of FEA Model
A finite element analysis (FEA) model must be validated before constructing a simulation database.A 2D finite element model with a half-diagonal of the WLCSP will be developed in this study to reduce simulation time.Figures 8 and 9 provide a schematic diagram of its structure.Figure 9 shows that the x-direction displacement at the left position is fixed to zero, while the y-direction displacement at the lower left corner of the model is also fixed to prevent rigid body motion.We set the following basic assumptions to maintain a balance between simulation efficiency and accuracy:

•
It is assumed that all materials in the structure are both isotropic and homogeneous; • The temperature is assumed to be isothermal throughout the structure; Residual stress is not taken into account; • All material interfaces within the structure are assumed to be perfectly bonded.

Validation of FEA Model
A finite element analysis (FEA) model must be validated before constructing a simulation database.A 2D finite element model with a half-diagonal of the WLCSP will be developed in this study to reduce simulation time.Figures 8 and 9 provide a schematic diagram of its structure.Figure 9 shows that the x-direction displacement at the left position is fixed to zero, while the y-direction displacement at the lower left corner of the model is also fixed to prevent rigid body motion.We set the following basic assumptions to maintain a balance between simulation efficiency and accuracy:

•
It is assumed that all materials in the structure are both isotropic and homogeneous; The temperature is assumed to be isothermal throughout the structure; Residual stress is not taken into account; • All material interfaces within the structure are assumed to be perfectly bonded.FEA models will be built based on the geometric dimensions of the test vehicles.The FEA model includes the following components: silicon chip; low-k layer; stress buffer layer (SBL); under bump metallurgy (UBM); redistribution layer (RDL); printed circuit board (PCB); copper pad; solder mask; and solder ball.The Surface Evolver software V2.70 will be used to estimate the geometric dimensions of the solder balls in the structure [21].For details, see Figure 10.It will be used to extract the coordinate information of key nodes on the solder ball for FEA.Figures 11 and 12 illustrate the details of the FEA model.FEA models will be built based on the geometric dimensions of the test vehicles.The FEA model includes the following components: silicon chip; low-k layer; stress buffer layer (SBL); under bump metallurgy (UBM); redistribution layer (RDL); printed circuit board (PCB); copper pad; solder mask; and solder ball.The Surface Evolver software V2.70 will be used to estimate the geometric dimensions of the solder balls in the structure [21].For details, see Figure 10.It will be used to extract the coordinate information of key nodes on the solder ball for FEA.Figures 11 and 12  FEA models will be built based on the geometric dimensions of the test vehicles.The FEA model includes the following components: silicon chip; low-k layer; stress buffer layer (SBL); under bump metallurgy (UBM); redistribution layer (RDL); printed circuit board (PCB); copper pad; solder mask; and solder ball.The Surface Evolver software V2.70 will be used to estimate the geometric dimensions of the solder balls in the structure [21].For details, see Figure 10.It will be used to extract the coordinate information of key nodes on the solder ball for FEA.Figures 11 and 12 illustrate the details of the FEA model.As shown in Figure 12, stable simulation results can be obtained by controlling the mesh size at the top right corner of the solder ball.Generally, the first failure occurs at the top right corner of the solder ball, where maximum DNP is present.The mesh sizes Tsou [3] proposed will be adopted: 12.5 μm in the X direction and 7.5 μm in the Y direction.Except for the solder ball, all other materials are set as linear elastic.The values are shown in Table 2.In this study, the material of the solder balls is SAC305.Young's modulus of    As shown in Figure 12, stable simulation results can be obtained by controlling the mesh size at the top right corner of the solder ball.Generally, the first failure occurs at the top right corner of the solder ball, where maximum DNP is present.The mesh sizes Tsou [3] proposed will be adopted: 12.5 μm in the X direction and 7.5 μm in the Y direction.Except for the solder ball, all other materials are set as linear elastic.The values are shown in Table 2.In this study, the material of the solder balls is SAC305.Young's modulus of As shown in Figure 12, stable simulation results can be obtained by controlling the mesh size at the top right corner of the solder ball.Generally, the first failure occurs at the top right corner of the solder ball, where maximum DNP is present.The mesh sizes Tsou [3] proposed will be adopted: 12.5 µm in the X direction and 7.5 µm in the Y direction.Except for the solder ball, all other materials are set as linear elastic.The values are shown in Table 2.In this study, the material of the solder balls is SAC305.Young's modulus of solder SAC305 is temperature-dependent, and its mechanical properties exhibit significant nonlinearity.The Chaboche dynamic hardening model is utilized to describe its nonlinearity at different temperatures.As shown in Figure 13, the stress-strain curves for SAC305 were obtained by performing uniaxial tensile tests on the material [22].These data will be used for curve fitting to obtain Chaboche model parameters.The formula is as follows: where σ 0 represents the initial yield stress.C and γ are both empirical coefficients.The parameter fitting results are shown in Table 3.
Thermal cycling loads are applied to the WLCSP model according to JEDEC standard Condition G, with temperatures ranging from −40 to 125 • C. The thermal cycling load remains at a dwell time of 10 min and a heating and cooling rate of 16.5 • C/min.It takes 40 min to complete one cycle.See Figure 14 for details.Using the Coffin-Manson equation, the simulated value of packaging reliability can be obtained with incremental equivalent plastic strain.Table 4 compares the mean time to failure (MTTF) of the experiments [23,24] and simulation results for five test vehicles (TVs).This table shows that the differences between simulated and experiment values are within an acceptable range (<10%).Therefore, the fatigue life of the packages can be accurately predicted.Based on the validated FEA models, theories, material constitutive equations, solution procedures, etc., the database will be built for machine learning.To generate the database, the material parameters, boundary conditions, and temperature load are fixed.Only the geometric dimensions of the WLCSP model are changed.The AI model will be trained using the database.Using the Coffin-Manson equation, the simulated value of packaging reliability can be obtained with incremental equivalent plastic strain.Table 4 compares the mean time to failure (MTTF) of the experiments [23,24] and simulation results for five test vehicles (TVs).This table shows that the differences between simulated and experiment values are within an acceptable range (<10%).Therefore, the fatigue life of the packages can be accurately predicted.Based on the validated FEA models, theories, material constitutive equations, solution procedures, etc., the database will be built for machine learning.To generate the database, the material parameters, boundary conditions, and temperature load are fixed.Only the geometric dimensions of the WLCSP model are changed.The AI model will be trained using the database.

Data Sampling and Model Training
To demonstrate the feasibility of small data training, this study will build the database and extract small samples from the database.The database establishment, data sampling, and model training will be described in detail.

Establish Database
Selecting features is typically the first step in establishing a database.Researchers need to choose features that are highly correlated with the predictive target [25], which often requires expert knowledge or numerical analysis.Figure 15 illustrates features highly correlated with WLCSP fatigue life based on expert knowledge.As an extension of the previous research [14], this study still uses the selection of four features.And the foundation for establishing the database is TV2.The distribution of the four features is as follows: upper pad diameter, lower pad diameter, SBL thickness, and chip thickness.Following the concept of space-filling in previous studies, we used as many data points as possible to fill the entire feature space evenly.We were able to achieve good performance in training the AI model.The database can be obtained by determining the value boundaries of the features and selecting node values for complete permutation and combination, as shown in Tables 5-10.Following the concept of space-filling in previous studies, we used as many data points as possible to fill the entire feature space evenly.We were able to achieve good performance in training the AI model.The database can be obtained by determining the value boundaries of the features and selecting node values for complete permutation and combination, as shown in Tables 5-10.This is the simplest form of space-filling.In total, there are over 9000 data points.We selected a certain proportion from them as the training set, and the performance of the AI model will improve with increased training data.Previous studies did not introduce additional sampling strategies and just used random sampling.This research will introduce the adaptive sampling method that will reduce the training data while maintaining the performance of the model.

Data Sampling (Random Pick)
In Section 4.3, K-medoids will be used for data sampling.As a comparison group, 200 samples were randomly selected for training.The visual distribution is shown in Figure 16.This figure's three axes represent the upper pad diameter, lower pad diameter, and chip thickness, while the "color bar" represents the SBL thickness.This image cannot directly assess the data distribution quality, and it needs to rely on the performance of AI models to evaluate it.Two hundred data points will be used as the training set, while the remaining data points will serve as the testing set.The performance of the AI model on the testing set will be the basis for evaluation.
It is worth mentioning that in classification problems, the decision boundary is often the focal point of research.Guan [26] revealed the impact of decision boundary complexity on model generalization.To address generalization issues, a commonly used method is boundary sampling.We identified data points close to the decision boundary within the existing dataset and generated additional samples near the boundary using interpolation or other techniques [27].Although there are no decision boundaries in regression problems, the performance of AI regression models near feature boundaries is also worth exploring.
In this study, the essence of adaptive sampling lies in sampling near feature boundaries.Its impact on AI model performance will be further investigated in Section 4.3.
model will improve with increased training data.Previous studies did not introduce additional sampling strategies and just used random sampling.This research will introduce the adaptive sampling method that will reduce the training data while maintaining the performance of the model.

Data Sampling (Random Pick)
In the next section, K-medoids will be used for data sampling.As a comparison group, 200 samples were randomly selected for training.The visual distribution is shown in Figure 16.This figure's three axes represent the upper pad diameter, lower pad diameter, and chip thickness, while the "color bar" represents the SBL thickness.This image cannot directly assess the data distribution quality, and it needs to rely on the performance of AI models to evaluate it.Two hundred data points will be used as the training set, while the remaining data points will serve as the testing set.The performance of the AI model on the testing set will be the basis for evaluation.
It is worth mentioning that in classification problems, the decision boundary is often the focal point of research.Guan [26] revealed the impact of decision boundary complexity on model generalization.To address generalization issues, a commonly used method is boundary sampling.We identified data points close to the decision boundary within the existing dataset and generated additional samples near the boundary using interpolation or other techniques [27].Although there are no decision boundaries in regression prob- To assess the performance of AI models, standards need to be established, including maximum training differences, average training differences, maximum testing differences, average testing differences, the number of testing data points for a "difference > 50 cycles" and the number of testing data points for a "difference percentage > 7%".A training difference indicates whether the model is underfitted, whereas a testing difference indicates whether it is overfitted.Using the other two standards, it is possible to determine the number of test points with inaccurate predictions intuitively.The preliminary preparation for model training has been completed so far.The ANN learning algorithm is being used with 200 training data points and over 9000 testing data points.The hyperparameter settings for the ANN are shown in Table 11.As mentioned in Section 2.2, the learning rate of Adam is adaptive.It is a simple model with only four inputs and one output, so there are not too many tricks involved.The hyperparameter settings and the selection of data preprocessing were determined based on previous experience [14].Here, data preprocessing specifically refers to data transformation.It is a method to adjust the range of feature values, which can optimize model performance [28].A robust scaler is the best choice in this case.It uses quartiles for data standardization, as shown in Equation (5).Grid search is used to find the optimal combination of neuron numbers for each layer.The maximum number of iterations is set as the condition to terminate model updates.
Here are two sets of results, as listed in Table 12."Neuron number" indicates the number of neurons in each hidden layer.These two models were selected from many models generated by the grid search, and they exhibit good performance in testing differences.The data corresponding to the "Maximum difference" in the table are as follows: prediction, target, absolute difference, and percent difference.The training differences of both models indicate that they have been sufficiently trained, and there is no underfitting.Testing differences indicate that the models were not accurate in predicting unknown data.In both cases, the maximum percentage of testing errors exceeds 15%.The number of test data points with inaccurate predictions (difference ≥ 50 cycles) exceeds 200.There is no doubt that the inadequacy of the training data is a contributing factor to the poor performance of the model.The small training set obtained through random sampling needs further optimization, either by increasing its size or by improving its distribution.
Increasing the number of data points in the regions of high interest is one of the most direct methods for improving the distribution of data, and adaptive sampling refers to this method.The method relies on acquiring prior knowledge, meaning the locations of the high-interest regions must be known.As previously mentioned, boundary sampling is noteworthy but requires further validation.It is necessary to first observe whether the inaccurately predicted test points tend to cluster near the feature boundaries.A clear clustering of these test points will indicate that these clustered regions are high-interest areas.
Except for SBL thickness, the results of the other three features are very similar, with proportions close to half.The detailed results are shown in Table 13.Among the 319 inaccurately predicted test points for Model I, 220 data points have an SBL thickness of less than 10.5 µm, accounting for a substantial portion of 70%.There are 153 data points, accounting for 48%, with the upper pad diameter at the boundary value.The situation with lower pad diameters and chip thicknesses is similar to that of the upper pad diameters.It is evidently necessary to increase the number of training data points near the feature boundaries and in regions with a small SBL thickness.

Data Sampling (Adaptive Method)
The results in Section 4.3 are consistent with our expectations: half of the inaccurate predictions are located at the feature boundaries.Just as data points near the decision boundary are prone to classification errors, data points near the feature boundaries also increase the risk of inaccurate predictions.Next, K-medoids will be used to perform feature boundary sampling.Since the dissimilarity between clusters, the uniformity of each cluster, and the space-filling principle are guaranteed by the clustering algorithms, the cluster centers are chosen to represent the characteristics of the clusters.
To increase training data near the feature boundaries uniformly, the feature space is split first.After comparative testing, spatial partitioning using four features versus three features has a limited impact on the final prediction performance.This study provides a focused analysis of one situation.
By dividing the upper pad, lower pad, and SBL into two sets each, and using permutations and combinations, the entire feature space can be divided into eight parts.From Table 14, in set 1, the upper and lower boundary values of the upper and lower pads are extracted.Referring to Table 13, we set 10.5 as the dividing value for the two sets of the SBL.Using K-medoids, we generated 25 cluster centers in each of the eight regions after partitioning for a total of 200 training data points.
--: Not involved in the split.
Table 15 displays the key configuration parameters of K-medoids."Metric" specifies the measure used to compute distances between data points, with Euclidean distance being the most widely utilized method."Method" specifies the specific approach used for clustering, with "Alternate" being chosen based on time cost considerations."K-medoids++" is an initialization method for cluster centers that ensures the centers are initialized with sufficient distance between them, facilitating quicker convergence to improved clustering outcomes.Set "random_state" to ensure consistency in random results.The distribution of new training data is shown in the figure below.Compared with Figure 16, Figure 17 shows a significant increase in the number of data points on the boundary surfaces.In addition, the number of blue data points with a small SBL thickness has increased.The AI models begin to be trained in Section 4.4 with new training data.
Max_iter 500 Compared with Figure 16, Figure 17 shows a significant increase in the number of data points on the boundary surfaces.In addition, the number of blue data points with a small SBL thickness has increased.The AI models begin to be trained in the next section with new training data.

AI Model Training
Besides continuously adjusting the hyperparameters of a model of four inputs and one output, the AI model's prediction ability relies on the data quality and quantity.Given

AI Model Training
Besides continuously adjusting the hyperparameters of a model of four inputs and one output, the AI model's prediction ability relies on the data quality and quantity.Given the limited quantity, previous work improved the quality of the training dataset.The new training set will be validated in this section.
Before applying new algorithms, the selection of the ANN should be validated against other known AI models.Kou et al. [4] and Su [6] reported the rather high prediction capability of Support Vector Regression (SVR) and Kernel Ridge Regression (KRR).Table 16 compares the prediction performance of different algorithms.To reiterate the learning task, there were 200 training data points with K-medoids, 9000+ testing data points, 4 inputs, and 1 output.All models were preprocessed using the robust scaler.In the table below, two ANN models are presented using different solvers.For ANN-1, the solver is 'Adam', while for ANN-2, the solver is 'L-BFGS'.Both solvers are commonly used for ANNs.However, 'L-BFGS' can be more effective for small datasets [29].The average training difference for each algorithm is small, which indicates that all algorithms have been trained successfully without underfitting.Based on the average testing difference, the ANN outperforms SVR and KRR.Therefore, the ANN will be explored in greater depth.The table below illustrates the performance of ANN models with different solvers, hidden layers, and neurons.
From Table 17, ANN-2 is the best-performing model in comparison.And it is obvious that the adjustment of important hyperparameters for the ANN has a limited impact on the average testing difference.Training a single ANN model is unable to improve performance further.On the other hand, the time cost for training on small datasets is much lower than that for training on large datasets.This is one of the advantages of training with small datasets.The new training dataset has greatly improved the performance of the AI model.The performance of the maximum testing difference has significantly improved.The number of test points with failed predictions has significantly decreased.Even though the performance is already quite good, there is room for further improvement.With the help of ensemble learning, the performance can be further improved.

Ensemble Learning
The ensemble learning shows high potential to improve the prediction accuracy against small datasets [30].This study only presents some preliminary results.Voting methods, or weighted averaging, have long been effective means of reducing system variance.That is also the core idea of "bagging".As mentioned earlier, the performance of a single ANN model is stable in testing data.Although their performance is close in numerical metrics, different ANN models often have distinct areas of inaccurate prediction.
Table 17 presents the performance of some existing ANN models from grid search.In fact, there are many ANN models that are well trained.Although individual ANN models may have limited performance, aggregating them can enhance performance.Under different hidden layers, we selected the ANN models with excellent performance on the testing set as the sub-models.All sub-models in this study have equal weights.Table 19 shows the performance results with 5 and 15 sub-models.At the same time, the final comparison results are shown in Table 20.Through intuitive comparison, both K-medoids and ensemble learning have the role of improving the performance of prediction.

Conclusions
Solder joint fatigue is one of the key reliability concerns for advanced electronic products, and identifying the mean time to failure of solder joints is very time-consuming and expensive.The development time will depend on the developer's simulation/fundamental/ domain knowledge and experience, and it will usually take several months to several years to determine the mean time to failure of solder joints.Applying AI-assisted Design-of-Simulation (AI-DoS) technology enables fast, accurate, consistent, and reliable prediction, as indicated in Figure 2.However, using as little data as possible to obtain a high-quality AI-trained model becomes a critical issue.
This study demonstrates the importance of training data quality and the immense potential of ensemble learning.Both core strategies of data sampling (space-filling method and adaptive method) have their merits.

Figure 2 .
Figure 2. The workflow of AI-assisted Design of Simulation.

Figure 2 .
Figure 2. The workflow of AI-assisted Design of Simulation.

Figure 2 .
Figure 2. The workflow of AI-assisted Design of Simulation.

Figure 3 .
Figure 3. Different sampling techniques [7].(a) Space-filling design; (b) Adaptive design.Th black dots represent the initial sample points, while the red squares indicate the subsequent ple points.Markus's work [8] demonstrated significant improvement in machine learning ing results by using adaptive sampling techniques to generate data.With the aim ducing training data without significantly affecting training results, this study wi utilize adaptive sampling techniques.Additionally, the choice of machine learning algorithms is crucial.Referring process depicted in Figure2, adaptive design is incorporated into the data gene module; this research selects an artificial neural network (ANN) and introduces ens learning in the training algorithm module.Dong et al.[9] conducted a comprehe review of mainstream approaches for ensemble learning.In essence, ensemble lea combines multiple weakly supervised models to achieve a stronger and more compr sive supervised model.As an extra machine learning strategy, this research will foc data generation and not delve extensively into ensemble learning.In summary, to reduce data sampling costs and model training time, this res will explore the feasibility of accurately predicting the reliability of advanced pack using a small dataset.In fact, small data analysis has been applied across various Izonin et al.[10] improved the performance of small data analysis in the field of bio cal engineering using improved neural network techniques.Zhang[11] achieved p prediction of the target problems in the field of materials science by incorporating knowledge and using Kernel Ridge Regression (KRR) in cases with small datasets.studies often focus on optimizing learning algorithms due to the uncontrollable nat data acquisition.This paper, however, focuses on data sampling techniques beca uses simulation to generate datasets.This study will be optimized based on the p shown in Figure2.Adaptive sampling techniques will be used to establish the data WLCSP.The learning algorithm will consistently use ANNs, and the trained mode be ensemble models to predict the final reliability.

Figure 3 .
Figure 3. Different sampling techniques [7].(a) Space-filling design; (b) Adaptive design.The black dots represent the initial sample points, while the red squares indicate the subsequent sample points.Markus's work [8] demonstrated significant improvement in machine learning training results by using adaptive sampling techniques to generate data.With the aim of reducing training data without significantly affecting training results, this study will also utilize adaptive sampling techniques.Additionally, the choice of machine learning algorithms is crucial.Referring to the process depicted in Figure2, adaptive design is incorporated into the data generation module; this research selects an artificial neural network (ANN) and introduces ensemble learning in the training algorithm module.Dong et al.[9] conducted a comprehensive review of mainstream approaches for ensemble learning.In essence, ensemble learning combines multiple weakly supervised models to achieve a stronger and more comprehensive supervised model.As an extra machine learning strategy, this research will focus on data generation and not delve extensively into ensemble learning.In summary, to reduce data sampling costs and model training time, this research will explore the feasibility of accurately predicting the reliability of advanced packaging using a small dataset.In fact, small data analysis has been applied across various fields.Izonin et al.[10] improved the performance of small data analysis in the field of biomedical engineering using improved neural network techniques.Zhang[11] achieved precise prediction of the target problems in the field of materials science by incorporating prior knowledge and using Kernel Ridge Regression (KRR) in cases with small datasets.These studies often focus on optimizing learning algorithms due to the uncontrollable nature of data acquisition.This paper, however, focuses on data sampling techniques because it uses simulation to generate datasets.This study will be optimized based on the process shown in Figure2.Adaptive sampling techniques will be used to establish the dataset for WLCSP.The learning algorithm will consistently use ANNs, and the trained models will be ensemble models to predict the final reliability.
incremental plastic strain on the x, y, and z axes, respectively.∆γ pl xy , ∆γ pl yz , and ∆γ pl xz represent the incremental shear strain on the surfaces of xy, yz, and xz.In practical calculations, finite element analysis can obtain the equivalent plastic strain of each thermal cycle, and the difference within a thermal cycle is denoted as ∆ε pl eq .

Figure 4 .
The main components include input, hidden, and output layers[16].The circles in Figure4represent the basic units of the ANN, called neurons.Materials 2024, 17, x FOR PEER REVIEW 5 of 20

Figure 4 .
Figure 4.The schematic illustration of an ANN.

Figure 5 .
Figure 5.A diagram of a neuron.

Figure 4 .
Figure 4.The schematic illustration of an ANN.The structure diagram of a single neuron with four inputs is shown in Figure5.The inputs on the left are multiplied by the corresponding weights, and the result is fed into the specified activation function to compute the output on the right.Clearly, the values of the weights directly affect the result of the loss function.Artificial neural networks have achieved efficient iteration and automatically updated their weights using the backpropagation algorithm.This has led to ANNs becoming one of the most popular machine learning algorithms.

Figure 4 .
Figure 4.The schematic illustration of an ANN.

Figure 5 .
Figure 5.A diagram of a neuron.

Figure 5 .
Figure 5.A diagram of a neuron.

Figure 6 .
Figure 6.The visualization results of both algorithms.The red circles represent the cluster centers, and the three colors denote the three clusters.

Figure 6 .
Figure 6.The visualization results of both algorithms.The red circles represent the cluster centers, and the three colors denote the three clusters.

Figure 8 .
Figure 8.The top view of the WLCSP structure.The red line represents the half-diagonal, and the circles represent solder balls.

Figure 8 . 20 Figure 9 .
Figure 8.The top view of the WLCSP structure.The red line represents the half-diagonal, and the circles represent solder balls.Materials 2024, 17, x FOR PEER REVIEW 8 of 20

Figure 10 .
Figure 10.The shape of the solder ball estimated by Surface Evolver.

Figure 10 .
Figure 10.The shape of the solder ball estimated by Surface Evolver.Figure 10.The shape of the solder ball estimated by Surface Evolver.

Figure 10 .
Figure 10.The shape of the solder ball estimated by Surface Evolver.Figure 10.The shape of the solder ball estimated by Surface Evolver.

Figure 10 .
Figure 10.The shape of the solder ball estimated by Surface Evolver.

Figure 11 .
Figure 11.A detailed view of the FEA model.

Figure 12 .
Figure 12.The mesh size of the key region.

Figure 11 .
Figure 11.A detailed view of the FEA model.

Figure 10 .
Figure 10.The shape of the solder ball estimated by Surface Evolver.

Figure 11 .
Figure 11.A detailed view of the FEA model.

Figure 12 .
Figure 12.The mesh size of the key region.

Figure 12 .
Figure 12.The mesh size of the key region.
Thermal cycling loads are applied to the WLCSP model according to JEDEC standard Condition G, with temperatures ranging from −40 to 125 °C.The thermal cycling load remains at a dwell time of 10 min and a heating and cooling rate of 16.5 °C/min.It takes 40 min to complete one cycle.See Figure 14 for details.

Figure 16 .
Figure 16.The visual distribution of the 200 training data points (random pick).

Figure 16 .
Figure 16.The visual distribution of the 200 training data points (random pick).

Figure 17 .
Figure 17.The visual distribution of new training data points (K-medoids).

Figure 17 .
Figure 17.The visual distribution of new training data points (K-medoids).

Table 1 .
Two main differences between two algorithms.

Table 2 .
Linear elastic material parameters for WLCSP.The stress-strain curve for SAC305.

Table 4 .
The comparison of five TVs' fatigue life.

Table 4 .
The comparison of five TVs' fatigue life.

Table 5 .
Feature values of 256 data points.

Table 6 .
Feature values of 625 data points.

Table 7 .
Feature values of 1296 data points.

Table 5 .
Feature values of 256 data points.

Table 6 .
Feature values of 625 data points.

Table 7 .
Feature values of 1296 data points.

Table 8 .
Feature values of 2401 data points.

Table 9 .
Feature values of 4096 data points.

Table 10 .
Feature values of new 1296 data points.

Table 11 .
The hyperparameters of the ANN.

Table 12 .
The performance of two ANN models.

Table 13 .
Data distribution of inaccurately predicted test points.

Table 14 .
The partitioning of the feature values.

Table 16 .
The performance results of different algorithms.

Table 17 .
The performance results of different ANN models.Table18compares the prediction performance of ANN models with old and new training data."Random pick" selected Model II, while "K-Medoids" selected ANN-2.It demonstrated that the adaptive sampling method is useful.

Table 18 .
The performance of ANN models with 2 different training datasets.

Table 19 .
The performance results of ensemble learning.

Table 21 .
The final comparison results.